Logistic regression is a parametric approach to classification. It gives a structure of probability of x by p(X)=1+eβ0+βXeβ0+βX where β0 is the intercept and β is the coefficient matrix. Where 1−p(X)p(X)=eβ0+βX is the odds.
- each βj represents the change of log-odds for one unit increase in Xj
- The conditional probability of P(Y=1∣X=x)=1+eβ0+xTβeβ0+xTβ.
- log-odds is the logits where log(1−p1(x)p1(x))=log(p0p1(x))=β0+xTβ.
That is, we can use maximum likelihood to estimate the parameters. The function is L(β0,β)=∏i=1np(Xi)Yi(1−p(Xi))1−Yi where Yi is the binary response.
The log-likelihood is l(β0,β)=∑i=1nYilogp(Xi)+(1−Yi)log(1−p(Xi))=∑i=1nYilog1−p(Xi)p(Xi)+log(1−p(Xi))=∑i=1nYiβ0+xiTβ+log(1−1+eβ0+xiTβeβ0+xiTβ)=∑i=1nYiβ0+xiTβ−log(1+eβ0+xiTβ).
We use Z-statics as the statistical properties of MLE where Z=SE[βj]β^j
The logistic loss we define as the log-likelihood of the model, since most of the negative log-likelihood is convex we can use Gradient Descent to find the optimal solution. We also call this negative log-likelihood as the loss function.
Gradient Descent
Assume β0=0, then we have following:
We have the log likelihood and it's conave so that we need the negative log likelihood −ℓ(β)=∑i=1n[−yixiTβ+log(1+exiTβ)] to process gradient descent.
The partial derivative at any β is ∂βj∂−ℓ(β)=∑i=1n[−yi+1+exiTβexiTβ]xij where β^(k+1)=β^(k)−α∑i=1n[−yi+1+exiTβ(k)exiTβ(k)]xi where α is the learning rate.
- ∣ℓ(β^(k+1))−ℓ(β^(k))∣ small enough to stop (i.e. ≤1e−6)
- ∥β^(k+1)−β^(k)∥2 is small or ∥β^(k+1)−β^(k)∥2/∥β^(k)∥2 is small